Back

Genetic Epidemiology

Wiley

All preprints, ranked by how well they match Genetic Epidemiology's content profile, based on 46 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Statistical Approach Leveraging Genealogies of Populations with a Founder Effect and Identical by Descent Segments to Identify Rare Variants in Complex Diseases

Bureau, A.; Girard, S.; Moreau, C.; Maziade, M.; Oubninte, S.

2025-09-18 epidemiology 10.1101/2025.09.16.25335588 medRxiv
Top 0.1%
38.9%
Show abstract

AbstractThe missing heritability caused by rare variants (RVs) poses a significant challenge to pre-established statistical methods. Our study aims at detecting RVs using identical-by-descent (IBD) segments as a proxy for recent variants in family data from a population with a founder effect for which genealogy is available--a distinguishing feature of our approach. Inferring IBD segments from genotype array data, which is more accessible than whole genome sequences, enables application to large sample sizes. Our approach involves dividing the genome into fixed-length windows, treating each window as a synthetic genomic region (SG), and then identifying groups of affected individuals sharing a specific IBD segment over an SG by analyzing genotype array data to infer pairwise IBD segments. Data from pairwise IBD segments is then used to identify densely connected haplotypes as IBD clusters via DASH. Lastly, we adapt, implement, and evaluate statistics to test for IBD sharing enrichment among affected individuals within SGs. The null distribution of the genome-wide maximal statistic value is obtained by simulating whole-genome transmission in a genealogy using msprime. For application purposes, Eastern Quebec has been studied as an example of a population with a founder effect. Using the BALSAC database to reconstruct the genealogy of 1,200 subjects across 48 schizophrenia and bipolar disorder multi-generational families led to an 18-generation pedigree with 84% completeness at the 10th generation. The statistic denoted as Smsg for the "most shared haplotype in an SG" exhibits superior power in detecting causal SGs when compared to the adapted Sall measure and (with a single causal variant in a region) to GMMAT (Generalized Linear Mixed Model Association Test) applied to IBD clusters. Our analysis of data pertaining to schizophrenia and bipolar disorder reveals no regions that surpass the conventional significance thresholds for harboring rare variants associated with these disorders. Two distinct regions--on chromosomes 5 and 11--stand out due to their maximal Smsg values. These findings underscore the potential of leveraging genealogical data and IBD segments to uncover rare variants in complex diseases.

2
Case-base-control designs

Elhezzani, N. S.; Bergsma, W.; Weale, M.

2019-08-02 genetics 10.1101/723452 medRxiv
Top 0.1%
32.7%
Show abstract

Most genome-wide association studies (GWASs) use randomly selected samples from the population (hereafter bases) as the control set. This approach is successful when the trait of interest is rare; otherwise, a loss in the statistical power to detect disease-associated variants is expected. To address this, a proposal to combine the three sample types, cases, controls and bases is introduced, for instances when the disease under study is prevalent. This is done by modelling the bases as a mixture of multinomial logistic functions of cases and controls, according to the disease prevalence. The maximum likelihood method is used to estimate the underlying parameters using the EM algorithm. Three classical tests of association; score, Walds, and likelihood ratio tests are derived and their power of detecting genetic associations under different designs is compared. Simulations show that combining the three samples can increase the power to detect disease-associated variants, though a very large base sample set can compensate for the lack of controls.

3
Bayestrat: Population Stratification Correction Using Bayesian Shrinkage Prior for Genetic Association Studies

Liu, Z.; Turkmen, A.; Lin, S.

2021-03-24 genetics 10.1101/2021.03.23.436705 medRxiv
Top 0.1%
32.5%
Show abstract

In genetic association studies with common diseases, population stratification is a major source of confounding. Principle component regression (PCR) and linear mixed model (LMM) are two commonly used approaches to account for population stratification. Previous studies have shown that LMM can be interpreted as including all principle components (PCs) as random-effect covariates. However, including all PCs in LMM may inflate type I error in some scenarios due to redundancy, while including only a few pre-selected PCs in PCR may fail to fully capture the genetic diversity. Here, we propose a statistical method under the Bayesian framework, Bayestrat, that utilizes appropriate shrinkage priors to shrink the effects of non- or minimally confounded PCs and improve the identification of highly confounded ones. Simulation results show that Bayestrat consistently achieves lower type I error rates yet higher power, especially when the number of PCs included in the model is large. We also apply our method to two real datasets, the Dallas Heart Studies (DHS) and the Multi-Ethnic Study of Atherosclerosis (MESA), and demonstrate the superiority of Bayestrat over commonly used methods.

4
A framework for detecting causal effects of risk factors at an individual level based on principles of Mendelian randomization: Applications to modelling individualized effects of lipids on coronary artery disease

SHI, Y.; Xiang, Y.; YE, Y.; HE, T.; SHAM, P.-C.; So, H.-C.

2024-01-20 epidemiology 10.1101/2024.01.18.24301507 medRxiv
Top 0.1%
28.6%
Show abstract

Mendelian Randomization (MR), a method that employs genetic variants as instruments for causal inference, has gained popularity in assessing the causal effects of risk factors. However, almost all MR studies primarily concentrate on the populations average causal effects. With the advent of precision medicine, the individualized treatment effect (ITE) is often of greater interest. For instance, certain risk factors may pose a higher risk to some individuals compared to others, and the benefits of a treatment may vary among individuals. This highlights the importance of considering individual differences in risk and treatment response. We propose a new framework that expands the concept of MR to investigate individualized causal effects. We presented several approaches for estimating Individualized Treatment Effects (ITEs) within this MR framework, primarily grounded on the principles of the"R-learner". To evaluate the existence of causal effect heterogeneity, we proposed two permutation testing methods. We employed Polygenic Risk Scores (PRS) as the instrument and demonstrated that the removal of potentially pleiotropic SNPs could enhance the accuracy of ITE estimates. The validity of our approach was substantiated through comprehensive simulations. We applied our framework to study the individualized causal effect of various lipid traits, including Low-Density Lipoprotein Cholesterol (LDL-C), High-Density Lipoprotein Cholesterol (HDL-C), Triglycerides (TG), and Total Cholesterol (TC), on the risk of Coronary Artery Disease (CAD) using data from the UK Biobank. Our findings indicate that an elevated level of LDL-C is causally linked to increased CAD risks, with the effect demonstrating significant heterogeneity. Similar results were observed for TC. We also revealed clinical factors contributing to the heterogeneity of ITE based on Shapley value analysis. Furthermore, we identified clinical factors contributing to the heterogeneity of ITEs through Shapley value analysis. This underscores the importance of individualized treatment plans in managing CAD risks.

5
eMAGMA: An eQTL-informed method to identify risk genes using genome-wide association study summary statistics

Gerring, Z. F.; Mina-Vargas, A.; Derks, E. M.

2019-11-25 genetics 10.1101/854315 medRxiv
Top 0.1%
28.5%
Show abstract

Identifying genes underlying genetic associations of complex disease is challenging because most common risk variants reside in non-protein coding regions of the genome and likely alter the expression of target genes by disrupting tissue and cell-type specific regulatory elements. To address this challenge, we developed a methodological framework, eQTL-MAGMA (eMAGMA), that converts SNP-level summary statistics into gene-level association statistics by assigning non-coding SNPs to their putative genes based on tissue-specific eQTL information. We compared eMAGMA to three eQTL informed gene-based approaches--S-PrediXcan, FUSION, and SMR--using simulated phenotype data. Phenotypes were simulated based on eQTL reference data using GCTA for all genes with at least one eQTL at chromosome 1 (651 genes). We performed 10 simulations per gene. The eQTL-h2 (i.e., the proportion of variation explained by the eQTLs was set at 1%, 2%, and 5%. We found eMAGMA outperforms other gene-based approaches across a range of simulated parameters (e.g. the number of identified causal genes). When applied to genome-wide association summary statistics for major depression, eMAGMA identified substantially more putative candidate causal genes compared to other eQTL-based approaches. By integrating tissue-specific eQTL information, these results show eMAGMA will help to identify novel candidate causal genes from genome-wide association summary statistics and thereby improve the understanding of the biological basis of complex disorders.

6
Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets

Park, D. K.; Chen, M.; Kim, S.; Joo, Y. Y.; Loving, R.; Kim, H.-S.; Cha, J.; Yoo, S.; Kim, J. H.

2022-01-22 genomics 10.1101/2022.01.19.476997 medRxiv
Top 0.1%
28.4%
Show abstract

Recently, polygenic risk score (PRS) has gained significant attention in studies involving complex genetic diseases and traits. PRS is often derived from summary statistics, from which the independence between discovery and replication sets cannot be monitored. Prior studies, in which the independence is strictly observed, report a relatively low gain from PRS in predictive models of binary traits. We hypothesize that the independence assumption may be compromised when using the summary statistics, and suspect an overestimation bias in the predictive accuracy. To demonstrate the overestimation bias in the replication dataset, prediction performances of PRS models are compared when overlapping subjects are either present or removed. We consider the task of Alzheimers disease (AD) prediction across genetics datasets, including the International Genomics of Alzheimers Project (IGAP), AD Sequencing Project (ADSP), and Accelerating Medicine Partnership - Alzheimers Disease (AMP-AD). PRS is computed from either sequencing studies for ADSP and AMP-AD (denoted as rPRS) or the summary statistics for IGAP (sPRS). Two variables with the high heritability in UK Biobank, hypertension, and height, are used to derive an exemplary scale effect of PRS. Based on the scale effect, the expected performance of sPRS is computed for AD prediction. Using ADSP as a discovery set for rPRS on AMP-AD, {Delta}AUC and {Delta}R2 (performance gains in AUC and R2 by PRS) record 0.069 and 0.11, respectively. Both drop to 0.0017 and 0.0041 once overlapping subjects are removed from AMP-AD. sPRS is derived from IGAP, which records {Delta}AUC and {Delta}R2 of 0.051{+/-}0.013 and 0.063{+/-}0.015 for ADSP and 0.060 and 0.086 for AMP-AD, respectively. On UK Biobank, rPRS performances for hypertension assuming a similar size of discovery and replication sets are 0.0036{+/-}0.0027 ({Delta}AUC) and 0.0032{+/-}0.0028 ({Delta}R2). For height, {Delta}R2 is 0.029{+/-}0.0037. Considering the high heritability of hypertension and height of UK Biobank, we conclude that sPRS results from AD databases are inflated. The higher performances relative to the size of the discovery set were observed in PRS studies of several diseases. PRS performances for binary traits, such as AD and hypertension, turned out unexpectedly low. This may, along with the difference in linkage disequilibrium, explain the high variability of PRS performances in cross-nation or cross-ethnicity applications, i.e., when there are no overlapping subjects. Hence, for sPRS, potential duplications should be carefully considered within the same ethnic group.

7
An Optimally Weighted Combination Method to DetectNovel Disease Associated Genes Using Publicly Available GWAS Summary Data

Zhang, J.; Gonzales, S.; Liu, J.; Gao, X. R.; wang, x.

2019-07-20 genetics 10.1101/709808 medRxiv
Top 0.1%
26.0%
Show abstract

Gene-based analyses offer a useful alternative and complement to the usual single nucleotide polymorphism (SNP) based analysis for genome-wide association studies (GWASs). Using appropriate weights (pre-specified or eQTL-derived) can boost statistical power, especially for detecting weak associations between a gene and a trait. Because the sparsity level or association directions of the underlying association patterns in real data are often unknown and access to individual-level data is limited, we propose an optimal weighted combination (OWC) test applicable to summary statistics from GWAS. This method includes burden tests, weighted sum of squared score (SSU), weighted sum statistic (WSS), and the score test as its special cases. We analytically prove that aggregating the variants in one gene is the same as using the weighted combination of Z-scores for each variant based on the score test method. We also numerically illustrate that our proposed test outperforms several existing comparable methods via simulation studies. Lastly, we utilize schizophrenia GWAS data and a fasting glucose GWAS meta-analysis data to demonstrate that our method outperforms the existing methods in real data analyses. Our proposed test is implemented in the R program OWC, which is freely and publicly available.

8
On a Unifying ‘Reverse’ Regression for Robust Association Studies and Allele Frequency Estimation with Related Individuals

Zhang, L.; Sun, L.

2019-06-04 genetics 10.1101/470328 medRxiv
Top 0.1%
25.8%
Show abstract

For genetic association studies with related individuals, standard linear mixed-effect model is the most popular approach. The model treats a complex trait (phenotype) as the response variable while a genetic variant (genotype) as a covariate. An alternative approach is to reverse the roles of phenotype and genotype. This class of tests includes quasi-likelihood based score tests. In this work, after reviewing these existing methods, we propose a general, unifying reverse regression framework. We then show that the proposed method can also explicitly adjust for potential departure from Hardy-Weinberg equilibrium. Lastly, we demonstrate the additional flexibility of the proposed model on allele frequency estimation, as well as its connection with earlier work of best linear unbiased allele-frequency estimator. We conclude the paper with supporting evidence from simulation and application studies.

9
A Kernel Method for Dissecting Genetic Signals in Tests of High-Dimensional Phenotypes

Solis-Lemus, C.; Holleman, A. M.; Todor, A.; Bradley, B.; Ressler, K. J.; Ghosh, D.; Epstein, M.

2021-07-30 genomics 10.1101/2021.07.29.454336 medRxiv
Top 0.1%
25.7%
Show abstract

Genomewide association studies increasingly employ multivariate tests of multiple correlated phenotypes to exploit likely pleiotropy to improve power. Typical multivariate methods produce a global p-value of association between a variant (or set of variants) and multiple phenotypes. When the global test is significant, subsequent interest then focuses on dissecting the signal and, in particular, delineating the set of phenotypes where the genetic variant(s) have a direct effect from the remaining phenotypes where the genetic variant(s) possess either indirect or no effect. While existing techniques like mediation models can be utilized for this purpose, they generally cannot handle high-dimensional phenotypic and genotypic data. To assist in filling this important gap, we propose a modification of a kernel distance-covariance framework for gene mapping of multiple variants with multiple phenotypes to test instead whether the association between the variants and a group of phenotypes is driven through a direct association with just a subset of the phenotypes. We use simulated data to show that our new method controls for type I error and is powerful to detect a variety of models demonstrating different patterns of direct and indirect effects. We further illustrate our method using GWAS data from the Grady Trauma Project and show that an existing signal between genetic variants in the ZHX2 gene and 21 items within the Beck Depression Inventory appears to be due to a direct effect of these variants on only 3 of these items. Our approach scales to genomewide analysis, and is applicable to high-dimensional correlated phenotypes.

10
Pathway Polygenic Risk Scores (pPRS) for the Analysis of Gene-environment Interaction

Gauderman, W. J.; Fu, Y.; Quem, B.; Kawaguchi, E.; Wang, Y.; Morrison, J.; Brenner, H.; Chan, A.; Gruber, S.; Temitope, K.; Li, L.; Moreno, V.; Pellatt, A.; Peters, U.; Samadder, N. J.; Schmit, S.; Ulrich, C.; Um, C.; Wu, A.; Lewinger, J. P.; Mi, H.; Drew, D.

2024-12-20 genetics 10.1101/2024.12.16.628610 medRxiv
Top 0.1%
25.5%
Show abstract

A polygenic risk score (PRS) is used to quantify the combined disease risk of many genetic variants. For complex human traits there is interest in determining whether the PRS modifies, i.e. interacts with, important environmental (E) risk factors. Detection of a PRS by environment (PRS x E) interaction may provide clues to underlying biology and can be useful in developing targeted prevention strategies for modifiable risk factors. The standard PRS may include a subset of variants that interact with E but a much larger subset of variants that affect disease without regard to E. This latter subset will water down the underlying signal in former subset, leading to reduced power to detect PRS x E interaction. We explore the use of pathway-defined PRS (pPRS) scores, using state of the art tools to annotate subsets of variants to genomic pathways. We demonstrate via simulation that testing targeted pPRS x E interaction can yield substantially greater power than testing overall PRS x E interaction. We also analyze a large study (N=78,253) of colorectal cancer (CRC) where E = non-steroidal anti-inflammatory drugs (NSAIDs), a well-established protective exposure. While no evidence of overall PRS x NSAIDs interaction (p=0.41) is observed, a significant pPRS x NSAIDs interaction (p=0.0003) is identified based on SNPs within the TGF-{beta} / gonadotropin releasing hormone receptor (GRHR) pathway. NSAIDS is protective (OR=0.84) for those at the 5th percentile of the TGF-{beta}/GRHR pPRS (low genetic risk, OR), but significantly more protective (OR=0.70) for those at the 95th percentile (high genetic risk). From a biological perspective, this suggests that NSAIDs may act to reduce CRC risk specifically through genes in these pathways. From a population health perspective, our result suggests that focusing on genes within these pathways may be effective at identifying those for whom NSAIDs-based CRC-prevention efforts may be most effective. Author SummaryThe identification of polygenic risk score (PRS) by environment (PRSxE) interactions may provide clues to underlying biology and facilitate targeted disease prevention strategies. The standard approach to computing a PRS likely includes many variants that affect disease without regard to E, reducing power to detect PRS x E interactions. We utilize gene annotation tools to develop pathway-based PRS (pPRS) scores and show by simulation studies that testing pPRS x E interaction can yield substantially greater power than testing PRS x E, while also integrating biological knowledge into the analysis. We apply our method to a large study of colorectal cancer to identify a significant pPRS x NSAIDs interaction (p=0.0003) based on SNPs within the TGF-{beta} / gonadotropin releasing hormone receptor (GRHR) pathway. Our findings suggest that focusing on genetic susceptibility within biologically informed pathways may be more sensitive for identifying exposures that can be considered as part of a precision prevention approach.

11
Shrinkage Parameter Estimation in Penalized Logistic RegressionAnalysis of Case-Control Data

Yu, Y.; Chen, S.; McNeney, B.

2021-02-14 genetics 10.1101/2021.02.12.430986 medRxiv
Top 0.1%
23.4%
Show abstract

IntroductionIncreasingly, logistic regression methods for genetic association studies of binary phenotypes must be able to accommodate data sparsity, which arises from unbalanced case-control ratios and/or rare genetic variants. Sparseness leads to maximum likelihood estimators (MLEs) of log-OR parameters that are biased away from their null value of zero and tests with inflated type 1 errors. Different penalized-likelihood methods have been developed to mitigate sparse-data bias. We study penalized logistic regression using a class of log-F priors indexed by a shrinkage parameter m to shrink the biased MLE towards zero. MethodsWe propose a two-step approach to the analysis of a genetic association study: first, a set of variants that show evidence of association with the trait is used to estimate m; and second, the estimated m is used for log-F -penalized logistic regression analyses of all variants using data augmentation with standard software. Our estimate of m is the maximizer of a marginal likelihood obtained by integrating the latent log-ORs out of the joint distribution of the parameters and observed data. We consider two approximate approaches to maximizing the marginal likelihood: (i) a Monte Carlo EM algorithm (MCEM) and (ii) a Laplace approximation (LA) to each integral, followed by derivative-free optimization of the approximation. ResultsWe evaluate the statistical properties of our proposed two-step method and compared its performance to other shrinkage methods by a simulation study. Our simulation studies suggest that the proposed log-F -penalized approach has lower bias and mean squared error than other methods considered. We also illustrate the approach on data from a study of genetic associations with "super senior" cases and middle aged controls. Discussion/ConclusionWe have proposed a method for single rare variant analysis with binary phenotypes by logistic regression penalized by log-F priors. Our method has the advantage of being easily extended to correct for confounding due to population structure and genetic relatedness through a data augmentation approach.

12
Mathematical bounds on r2 and the effect size in case-control genome-wide association studies

Paye, S. M.; Edge, M. D.

2024-12-17 genetics 10.1101/2024.12.17.628943 medRxiv
Top 0.1%
23.2%
Show abstract

Case-control genome-wide association studies (GWAS) are often used to find associations between genetic variants and diseases. When case-control GWAS are conducted, researchers must make decisions regarding how many cases and how many controls to include in the study. Depending on differing availability and cost of controls and cases, varying case fractions are used in case-control GWAS. Connections between variants and diseases are made using association statistics, including{chi} 2. Previous work in population genetics has shown that LD statistics, including r2, are bounded by the allele frequencies in the population being studied. Since varying the case fraction changes sample allele frequencies, we extend use the known bounds on r2 to explore how variation in the fraction of cases included in a study can impact statistical power to detect associations. We analyze a simple mathematical model and use simulations to study a quantity proportional to the{chi} 2 noncentrality parameter, which is closely related to r2, under various conditions. Varying the case fraction changes the{chi} 2 noncentrality parameter, and by extension the statistical power, with effects depending on the dominance, penetrance, and frequency of the risk allele. Our framework explains previously observed results, such as asymmetries in power to detect risk vs. protective alleles, and the fact that a balanced sample of cases and controls does not always give the best power to detect associations, particularly for highly penetrant minor risk alleles that are either dominant or recessive. We show by simulation that our results can be used as a rough guide to statistical power for association tests other than{chi} 2 tests of independence.

13
Incorporating discovery and replication GWAS into summary data Mendelian randomization studies: A review of current methods and a simple, general and powerful alternative

Mounier, N.; Robertson, D. S.; Kutalik, Z.; Dudbridge, F.; Bowden, J.

2023-01-13 genetics 10.1101/2023.01.12.523708 medRxiv
Top 0.1%
23.1%
Show abstract

Mendelian Randomization (MR) is a popular method for using genetics to estimate the causal effect of a modifiable exposure on a health outcome. Single Nucleotide Polymorphisms (SNPs) are typically selected for inclusion if they pass a genome-wide significance threshold in order to guarantee that they are strong genetic instruments, but this also induces Winners curse, as SNP-exposure associations tend to be overestimated. In this paper, we consider how to combine SNP-exposure data from discovery and replication samples using two-sample and three-sample approaches to best account for Winners curse, weak instrument bias, and pleiotropy within a summary data MR framework, using only GWAS summary statistics. After reviewing several existing methods, that often correct for Winners curse at the individual SNP level, we propose a simple alternative based on the technique of regression calibration that enacts a global correction to the causal effect estimate directly. This approach does not only correct for Winners curse, but also simultaneously accounts for weak instruments bias. Regression calibration can be used with a wide range of existing MR methods, including pleiotropy-robust methods such as median-based and mode-based estimators. Extensive simulations and real data examples are used to illustrate the utility of the new approach. Software is provided for users to implement the method in practice. Author SummaryMendelian randomization is a method to explore causation in health research which exploits the random inheritance of genes from parents to offspring as a natural experiment. It attempts to quantify the effect of intervening and modifying a health exposure, such as a persons body mass, on a downstream outcome such as blood pressure. Causal estimates obtained using this method can be strongly influenced by the set of genes used, or more specifically, the rationale used to select them. For example, selecting only genes that are strongly associated with the health exposure can induce bias due to the Winners curse. Unfortunately, using genes with a small association can lead to so-called weak instrument bias leading to a no-win paradox. In this paper, we present a novel approach based on the technique of regression calibration to de-bias causal estimates in an MR study. Our approach relies on the use of two independent samples for the exposure (discovery and replication) to estimate the amount of bias that is expected for a specific set of genes, so that causal estimates can be re-calibrated accordingly. We use extensive simulations and applied examples to compare our approach to current methods and provide software for researchers to implement our approach in future studies.

14
Using Negative Control Outcomes to Detect Selection Bias in Mendelian Randomization Studies

Gkatzionis, A.; Davey Smith, G.; Tilling, K.

2026-02-01 epidemiology 10.64898/2026.01.30.26345215 medRxiv
Top 0.1%
23.0%
Show abstract

Mendelian randomization is currently mainly implemented through the use of genetic variants as instrumental variables to investigate the causal effect of an exposure on an outcome of interest. Mendelian randomization studies are robust to confounding bias and reverse causation, but they remain susceptible to selection bias; for example, this can happen if the exposure or outcome are associated with selection into the study sample. Negative controls are sometimes used to detect biases (typically due to confounding) in observational studies. Here, we focus specifically on Mendelian randomization analyses and discuss under what conditions a variable can be used as a negative control outcome to detect selection mechanisms that could bias Mendelian randomization estimates. We show that the main requirement is that the negative control outcome relates to confounders of the exposure and outcome. Counter-intuitively, the effect of the negative control on selection is of secondary concern; for example, a variable that does not affect selection can be a valid negative control for an outcome that does. We also investigate under what conditions age and sex can be used as negative control outcomes in Mendelian randomization analyses. In a real-data application, we investigate the pairwise causal relationships between 19 traits, utilizing data from the UK Biobank. Treating biological sex as a negative control outcome, we identify selection bias in analyses involving commonly used traits such as alcohol consumption, body mass index and educational attainment.

15
Adjusting for medication status in genome-wide association studies

Chong, A. H. W.; Kintu, C.; Cho, Y.; Fatumo, S.; Torres, J.; Davey Smith, G.; Gaunt, T. R.; Hemani, G.

2024-02-20 epidemiology 10.1101/2024.02.19.24303028 medRxiv
Top 0.1%
22.9%
Show abstract

When conducting genome-wide association studies, improper handling of medication status that is relevant to the trait of interest can induce biases by opening up different pathways that distort estimates of the true effect. Here, we propose the genetic empirical medication reduction adjustment (GEMRA) method which uses a heuristic search for an empirical adjustment to be applied to phenotypic values of participants reporting medication use. Through simulations we show that the direct genetic effect estimates in the GEMRA approach exhibited less bias and greater statistical power than either restricting the sample to unmedicated users, or including all samples without adjustment. We then applied the GEMRA approach to estimate statin medication adjustment for analysis of LDL cholesterol levels, using multi ancestry data from UK Biobank and the Uganda Genome Resource. We found that a relative rather than an absolute adjustment better modelled the effect of medication on LDL cholesterol, with an effect of 40% reduction appearing to be consistent across ancestral groups. These findings are consistent with the current clinical guidelines.

16
A comparison of the genes and genesets identified by EWAS and GWAS of fourteen complex traits

Battram, T.; Gaunt, T. R.; Relton, C. L.; Timpson, N. J.; Hemani, G.

2022-03-25 epidemiology 10.1101/2022.03.25.22272928 medRxiv
Top 0.1%
22.9%
Show abstract

Identifying the genes, properties of these genes and pathways to understand the underlying biology of complex traits responsible for differential health states in the population is a common goal of epigenome-wide and genome-wide association studies (EWAS and GWAS). GWAS identify genetic variants that effect the trait of interest or variants that are in linkage disequilibrium with the true causal variants. EWAS identify variation in DNA methylation, a complex molecular phenotype, associated with the trait of interest. Therefore, while GWAS in principle will only detect variants within or near causal genes, EWAS can also detect genes that confound the association between a trait and a DNA methylation site, or are reverse causal. Here we systematically compare association EWAS and GWAS results of 14 complex traits (N > 4500). A small fraction of detected genomic regions were shared by both EWAS and GWAS (0-9%). We evaluated if the genes or gene ontology terms flagged by GWAS and EWAS overlapped, and after a multiple testing correction, found substantial overlap for diastolic blood pressure (gene overlap P = 5.2x10-6, term overlap P = 0.001). We superimposed our empirical findings against simulated models of varying genetic and epigenetic architectures and observed that in a majority of cases EWAS and GWAS are likely capturing distinct genesets, implying that genes identified by EWAS are not generally causally upstream of the trait. Overall our results indicate that EWAS and GWAS are capturing different aspects of the biology of complex traits.

17
Using summary statistics to evaluate the genetic architecture of multiplicative combinations of initially analyzed phenotypes with a flexible choice of covariates

Wolf, J. M.; Westra, J.; Tintle, N.

2021-03-09 genetics 10.1101/2021.03.08.433979 medRxiv
Top 0.1%
22.8%
Show abstract

While the promise of electronic medical record and biobank data is large, major questions remain about patient privacy, computational hurdles, and data access. One promising area of recent development is pre-computing non-individually identifiable summary statistics to be made publicly available for exploration and downstream analysis. In this manuscript we demonstrate how to utilize pre-computed linear association statistics between individual genetic variants and phenotypes to infer genetic relationships between products of phenotypes (e.g., ratios; logical combinations of binary phenotypes using and and or) with customized covariate choices. We propose a method to approximate covariate adjusted linear models for products and logical combinations of phenotypes using only pre-computed summary statistics. We evaluate our methods accuracy through several simulation studies and an application modeling various fatty acid ratios using data from the Framingham Heart Study. These studies show consistent ability to recapitulate analysis results performed on individual level data including maintenance of the Type I error rate, power, and effect size estimates. An implementation of this proposed method is available in the publicly available R package pcsstools.

18
Partitioning Fraction of Variance Explained into Strong Localized Effects and Weak Diffuse Effects

Nan, F.; Azriel, D.; Schwartzman, A.

2026-01-07 genetics 10.64898/2026.01.06.697735 medRxiv
Top 0.1%
22.7%
Show abstract

High-dimensional genetic data present substantial challenges for estimating the fraction of variance explained (FVE) by genome-wide single-nucleotide polymorphisms (SNPs). Standard approaches for SNP heritability estimation, such as GWAS heritability (GWASH) and linkage disequilibrium score (LDSC) regression, typically assume Gaussian distributions for SNP effect sizes. However, empirical evidence indicates that SNP effects are often heavy-tailed, with a small subset of variants exerting disproportionately large influence. Such settings violate the recently established bounded-kurtosis effect (BKE) condition, under which these FVE estimators are consistent. Consequently, widely used methods may yield severely biased estimates when strong effects are present. We introduce a decomposed FVE estimation framework that accommodates heavy-tailed and heterogeneous SNP effect distributions. The proposed approach partitions total heritability into contributions from strong and weak genetic effects, estimating the former using low-dimensional adjusted R2 and the latter using an extension of FVE estimation methodology that remains valid under BKE compliance. We further develop a test for detecting violations of the BKE condition and compare several high-dimensional screening procedures for identifying strong-effect SNPs when they are not known in advance. Simulation studies show that the proposed decomposition substantially improves estimation accuracy over existing approaches in the presence of heavy-tailed effects. Application to the Adolescent Brain Cognitive Development (ABCD) Study demonstrates the practical utility of the method, yielding more reliable heritability estimates for the PolyVoxel Score, a neuroimaging-based biomarker linked to iron accumulation. Our results highlight the importance of accommodating effect heterogeneity in large-scale genomic studies.

19
Robust use of phenotypic heterogeneity at drug target genes for mechanistic insights: application of cis-multivariable Mendelian randomization to GLP1R gene region

Patel, A.; Gill, D.; Shungin, D.; Mantzoros, C. S.; Knudsen, L. B.; Bowden, J.; Burgess, S.

2023-07-25 epidemiology 10.1101/2023.07.20.23292958 medRxiv
Top 0.1%
22.7%
Show abstract

Phenotypic heterogeneity at genomic loci encoding drug targets can be exploited by multivariable Mendelian randomization to provide insight on the pathways by which pharmacological interventions may affect disease risk. However, statistical inference in such investigations may be poor if overdispersion heterogeneity in measured genetic associations is unaccounted for. In this work, we first develop conditional F-statistics for dimension-reduced genetic associations that enable more accurate measurement of phenotypic heterogeneity. We then develop a novel extension for two-sample multivariable Mendelian randomization that accounts for overdispersion heterogeneity in dimension-reduced genetic associations. Our empirical focus is to use genetic variants in the GLP1R gene region to understand the mechanism by which GLP1R agonism affects coronary artery disease (CAD) risk. Colocalization analyses indicate that distinct variants in the GLP1R gene region are associated with body mass index and type 2 diabetes. Multivariable Mendelian randomization analyses that were corrected for overdispersion heterogeneity suggest that bodyweight lowering rather than type 2 diabetes liability lowering effects of GLP1R agonism are more likely contributing to reduced CAD risk. Tissue-specific analyses prioritised brain tissue as the most likely to be relevant for CAD risk, of the tissues considered. We hope the multivariable Mendelian randomization approach illustrated here is widely applicable to better understand mechanisms linking drug targets to diseases outcomes, and hence to guide drug development efforts.

20
Testing the effectiveness of principal components in adjusting for relatedness in genetic association studies

Yao, Y.; Ochoa, A.

2019-11-29 genetics 10.1101/858399 medRxiv
Top 0.1%
22.7%
Show abstract

Modern genetic association studies require modeling population structure and family relatedness in order to calculate correct statistics. Principal Components Analysis (PCA) is one of the most common approaches for modeling this population structure, but nowadays the Linear Mixed-Effects Model (LMM) is believed by many to be a superior model. Remarkably, previous comparisons have been limited by testing PCA without varying the number of principal components (PCs), by simulating unrealistically simple population structures, and by not always measuring both type-I error control and predictive power. In this work, we thoroughly evaluate PCA with varying number of PCs alongside LMM in various realistic scenarios, including admixture together with family structure, measuring both null p-value uniformity and the area under the precision-recall curves. We find that PCA performs as well as LMM when enough PCs are used and the sample size is large, and find a remarkable robustness to extreme number of PCs. However, we notice decreased performance for PCA relative to LMM when sample sizes are small and when there is family structure, although LMM performance is highly variable. Altogether, our work suggests that PCA is a favorable approach for association studies when sample sizes are large and no close relatives exist in the data, and a hybrid approach of LMM with PCs may be the best of both worlds.